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Abstract 

In this paper we provide a definition of pattern of outliers in con- 
tingency tables within a model-based framework. In particular, we 
make use of log-linear models and exact goodness-of-fit tests to specify 
the notions of outlier and pattern of outliers. The language and some 
techniques from Algebraic Statistics are essential tools to make the def- 
inition clear and easily applicable. We also analyze several numerical 
examples to show how to use our definitions. 

Key words: Algebraic Statistics; goodness-of-fit tests; log-linear mod- 
els; toric models. 

1 Introduction 

The detection of outliers is one of the most important problems in Statis- 
tics and it is a current research topic in the field of contingency tables and 
ca tegorical data . Some recent developments in this direction can be found 



m 



Kuhntl rt2004h . where the author describes a procedure to identify outliers 



based on the tails of the Poisson distribution and discusses the use of differ- 
ent estimators to compute the expected counts under the null hypothesis. 
A model-based approach to the detection of unexpected cell counts is the 
Configural Frequency Analysis (CFA), where the outlying counts are called 
"types" or "antitypes" if they are significantly higher or smaller with respect 
to the expected counts under a suitable model. The us e of log-linear mod- 
els for CFA was present ed in Kieser and Victor ( 19991 ) and reanalyzed 



in 



von Eve and Mairl (12008b A complete acco unt on theory a nd ap plications 
of CFA can be found in|vonEyd (120021 ) and lvon Eve et al\ (|201ol b 
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The difficulties behind the definition of outlying cell in contingency tables 
is proved by the number of different approaches. About these difficulties, and 
more generally on the old question: "Wh at a contingency table doe s say 



?" 



an interesting discussion is presented in iKateri and Balakrishnanl ((2008). 
Some basic notions and appropriate references for existing methods will be 
given later. 

The notion of outlier for univariate and multivariate continuous distribu- 
tions is a well known fact. For example, in the univariate case the outliers 
are usually detected through the boxplot or the comparison of the stan- 
dardized values with respect to the quantiles of the normal distribution. It 
should be noted that there is no unique mathema t ical d efinition of outlier, 
as pointed out for instance in Barnett and Lewis] (1994). Notice also that 
the notion of outlier should be considered as outlier with respect to a speci- 
fied probability model. For instance, in the continuous univariate case, it is 
usual to consider outliers with respect to the Gaussian distribution, leading 
to the well known three-sigma criterion. 

The notion of outlier for contingency tables has a less clear meaning. In 
fact, the random variables we consider are categorical and the cells of the 
table are counts. When we consider contingency tables, we do not define the 
outliers among the subjects, but among the counts. As the counts can be 
modelled in a simple Poisson sampling scheme, one would use the quantiles 
of the Poisson distribution in order to detect the outliers in a contingency 
table. Using a different approach, the detection of outliers can also be 
deduced from the analysis of the a d justed residuals. This approach has 
been presented in lFuchs and Kenettl (11980) to t est the presence of outliers 
in a table, while the algorithm in lSimonoff ( 19881 ) uses the adjusted residuals 
and their contribution to the chi-squared Pearson's test statistics to detect 
the position of the outlying cells. 

In the past decade, Algebraic Statistics has been a very growing re- 
search area, with major applications to the analysis of contingency ta- 
bles. Algebraic Statistics now provides an easy description of complex 
log-linear models for multi-way tables and it represents the natural envi- 
ronment to define statistical models for contingency tables with structural 
zeros, through the notion of toric models. Moreover, non-asymptotic in- 
ference is now more actual via the use of Markov bases and the Diaconis- 
Sturmfels algorithm. As gener al references on the use of Algebraic Statis- 
tics fo r con tingency tabl e s, see iPistone et al\ (|200ll ) , iPachter and Sturmfels 
(l2005l l and brton et all (120091 ) . Some specific statistical models t o stud y 
complex structures i n con tingenc y tables can be found in iRapallol (|2005h . 
Carlini and Rapallol ( 2010l ) and Carlini and Rapallo ( 2011 ). with relevant 



2 



applications in the detection of special behaviours of some subsets of cells 
(quasi-independence models, quasi-symmetry models, weakened indepen- 
dence models). 

In this paper, we use the dictionary, the reasoning and some techniques 
from Algebraic Statistics in order to study the notion of outliers in contin- 
gency tables. The outliers are defined in terms of goodness-of-fit tests for 
tables with fixed cell counts. Then, we investigate the main properties of 
the outliers and we show how Algebraic Statistics is a useful tool both to 
make exact inference for goodness-of-fit tests, and to easily describe complex 
structures of outliers. We notice that the procedure defined here is mainly 
useful as a confirmatory analysis after a detection step based, for example, 
on the analysis of the residuals. We will use this approach in the numerical 
examples, detecting the candidate outliers through the residuals and then 
testing them with the appropriate goodness-of-fit test. More details on that 
issue will be discussed later in the paper. 

The material is organized as follows. In Section [2] we recall some def- 
initions and basic results about toric models, while in Section [3] we show 
how to study a single outlying cell in the framework of toric models and we 
describe explicitly the Monte Carlo test using Markov bases. In Section U 
we present the notions of sets and patterns of outliers, and we analyze two 
real-data examples. Finally, Section [5] contains some concluding remarks 
and pointers to future works. In order to help readers with little experience 
in polynomial algebra, we have decided to focus the presentation on the sta- 
tistical ideas. Thus, in the main body of the paper we have avoided formal 
definitions whenever possible, and we have grouped in the Appendix all the 
needed technical facts from Algebraic Statistics. 

2 Some recalls about log-linear and toric models 

A probability distribution on a finite sample space X with K elements is a 
normalized vector of K non-negative real numbers. Thus, the most general 
probability model is the simplex 



A statistical model M is therefore a subset of A. 

A classical example of finite sample space is the case of a multi-way con- 
tingency table where the cells are the joint counts of two or more random 
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variables with a finite number of levels each. In the case of two-way con- 
tingency tables, where the sample space is usually written as a cartesian 
product of the form X = {1,...,/} x {1,...,J}. We will consider this case 
extensively in the next sections. 

A wide cl ass of statistic al models for contingency tables are the log- 
linear models ( Agresti . 20021 ). Under the classical Poisson sampling scheme, 



the cell counts are independent and identically distributed Poisson random 
variables with means Npi, . . . , Npx, where N is the sample size, and the 
statistical model specifies constraints on the parameters p\ , . . . ,pk- A model 
is log-linear if the log-probabilities lie in an afHne subspace of the vector space 
R^. Given d real parameters a\,...,ad, a log-linear model is described, 
apart from normalization, through the equations: 



logOfc) = ^A Kr a r 



for k = 1, . . . , K, where A is the design matrix, see Ch.6 in [Pistone et al. 



(|200lh . Exponentiating Eq. CD), we obtain the expression of the correspond 



ing toric model 

d 

r=l 

for k = 1, . . . , K, where Cr = exp(a r ), r = 1, . . . , d, are the new non-negative 
parameters. It follows immediately that the design matrix A is also the 
matrix representation of the minimal sufficient statistic of the model. 

Notice that the model representations in Eq. (pQ) and ([2]) are equivalent 
on the open simplex, but the toric representation allows us to consider also 
the boundary and, therefore, the tables with structural zeros. This issue 
will be essential in our definition of outliers. The matrix r epresenta t ion o f 
the toric models as in Eq. ([2]) is widely discussed in, e.g., iRapallol (|2007h 
and brton et all (j2009h . 



To obtain the implicit equations of the model, it is enough to eliminate 
the C parameters from the system in Eq. ([2]). In this paper, we will make 
use of the following ingredients from Algebraic Statistics: 

(i) the toric ideal Xa of a statistical toric model with design matrix A; 

(ii) the variety Va of the model; 

(iii) the Markov basis Ad a of the model. 
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To keep the exposition simple, we have collected the formal definitions of 
these objects and some basic results on them in the Appendix. We mention 
here only a few basic consequences of that results that will be used in our 
presentation. 

The toric ideal Xa of a toric model is by definition the set of polynomials 
vanishing at each point of the model. Each toric ideal is generated by a 
finite set of binomials, and thus we can write 

1 A = Ideal(gi, . . . ,gi) , (3) 

meaning that each polynomial g € Xa can be written in the form g = 
r\gi + . . . + rigi for suitable polynomials r±, . . . , r^. 

The binomials g\ , . . . , gi can be actually computed with symbolic soft- 
ware without any difficulties, at least for small- and medium-sized tables, 
and we assume such binomials as given together with the design matrix 
A. We write a binomial in vectorial form g = p a — p b meaning g = 
YlkPk k ~ WkPk ■ Notice that for strictly positive probabilities the equa- 
tion p a — p b = is equivalent to log(p a /p b ) = 0. Therefore, the vanishing of 
a binomial correspond to the vanishing of a log odds ratio and vice-versa. 
The vanishing log odds ratios associated to a design matrix can be computed 
without polynomial algebra, as they are the output of simple matrix com- 
putations. Nevertheless, we emphasize that the usefulness of the binomials 
in Definition [3] is twofold: 

• on one hand, the binomials gi,...,gi determine the statistical model 
in the closed simplex A. In fact, the variety Va associated to 1a is 
the set of points 

Va = {P = (Pi, ■ ■ ■ ,Pk) ■■ giip) = 0, . . ., gt (p) = 0} c R K 
and, therefore, we obtain the statistical model simply by normalization 

• on the other hand, the £ binomials naturally define i integer tables, 
called log-vectors, obtained by taking the exponents of the t binomials 
with the map 

g = p a — p b — > m = a — b . 

The tables mi, . . . , mi form a Markov basis Ma for the model, which 
we will use to perform non-asymptotic goodness-of-fit tests. See the 
Appendix for further details on Markov bases. 
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To conclude, the binomials can be used both to study the geometry of 
the statistical model and for the definition of a Markov basis for the non- 
asymptotic goodness-of-fit test. 

As an example in the two-way setting, the independence model for 3x3 
tables is represented by the matrix 

fl 1 1 0\ 
110 1 
110 
10 110 

And = 10 10 1 

10 10 
10 10 
1 1 
\l 0/ 

while the quasi-independence model, which encodes independence of the two 
random variables except for the diagonal cells is represented by 



/I 



A 



q— ind 



1 



1 1 0\ 

1 1 1 

1 1 

1 1 1 

10 10 10 10 

10 10 10 

1 1 

1 1 

\l lj 

The last three columns of A q _- m( i force the diagonal cells to be fitted exactly. 
For further details on the quasi-independence models, see Bishop et al. ( 19751 ). 
The equations of the independence model with design matrix And is the set 
of all 2 x 2 minors of the table of probabilities, i.e., 

l Aind = Ideah>i i: LP2,2 ~Pl,2P2,l, Pl,lP2,3 -Pl,3P2,l, Pl,lP3,2 ~ Pl,2P3,l, 

Pl,lP3,3 ~ Pl,3P3,l, Pl,2P2,3 ~ Pl,3P2,2, Pl,2P3,3 ~ P3,2P2,3, (4) 
P2.1P3.2 ~P3,lP2,2, P2,lP3,3 ~ P3,lP2,3, P2,2P3,3 ~ P3,2P2,3) , 

while for the quasi-independence model from the matrix A q _i nc j we have 
only one binomial: 



1a„ 



Ideal(j>i i2 P2,3P3,l - Pl,3P3,2P2,l) ■ 
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Remark 1 We point out that the independence model can be described i n 
terms of vector spaces by 4 linearly independent log- vectors ( Agresti . 2002), 
and typically one can use the log-vectors of the 4 adjacent minors, but to 
have a Markov basis we need all the 9 binomials in Eq. 

Notice that, from the point of view of the statistica l models , a fixe d cell 
count has the same behaviour as a structural zero. See iRapallol (120061 1 for a 
discussion on this issue. This fact suggests that outliers can be modelled in 
the framework of statistical models with structural zeros, as we will make 
precise in the following section. The use of structural zeros to model contin- 
gency tables with complex structure is presented in IConsonni and Pistone 
((20071 ) under the point of view of Bayesian inference. 



Remark 2 In the special case of independence model for two-way tables, 
the use of 2 x 2 minors as in Eq. (j4]) to detect outliers was implemented in 
Kotze and Hawkins! (|1984l ). We also mention that the connections between 
the implicit equations of the model and the adjusted residuals are known at 
least in the simple case of the independ ence model for two-way table, see 
for instance iTsumoto and Hiranol (|2007l ) . 



3 Outliers 

Example 1 Let us consider the following synthetic contingency table: 



/ 



n 


2 


2 


2\ 


2 


2 


2 


2 


2 


2 


2 


2 


\3 


2 


2 


V 



(5) 



Under the independence model, it seems that the cell (1, 1) could be an 
outlier. 

With the approach presented in Fuchs and Kenett (jl980l ). the observed 
contingency table / is the realization of a multinomial distribution and the 
authors analyze the adjusted residuals under the independence model 



Z. 



fjj - fi,+f +J /N 



i, j 



y/f ii+ (N-f i>+ )f +ij (N-f +ij )/N* 



for i = 1, . . . , I and j = 1, . . . , J, where ./V is the sample size and /j + and 
f+j are the row and column sums, respectively. To check the presence of 
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outlying cells, the authors use the test statistics Z = maxjj \Z{j\ and they 
find suitable approximations for the two-sided a-level critical value, using 
the standard Normal distribution. The use of the a djusted residuals to 
detect outliers was first described in jHabermanl (Il973h . However, we warn 
that the test in iFuchs and Kenettl ()1980l ) is a global test and it is not useful 
to detect the position of the outliers in the table. 

On the other hand, the approach described in iKuhntJ (j2004h is based 
on the computation of the ML (or L\) estimate of the mean of the Poisson 
distributions for the cell counts, and then a cell is declared as outlier if the 
actual count falls in the tails of the appropriate Poisson distribution. 

Let us analyze the observed table / above under the two app r oache s 
described here. Using the adjusted residuals as in lFuchs and Kenett (ll980h . 
the value of the test statistics is z = 1.5670 (the highest adjusted residuals), 
while the critical value at the a = 5% level is 2.9478, showing tha t there 
i s no evidence of outlying cells. Under the Poisson approach as in iKuhnt 
(2004), we find that the observed value in the cell (1,1) is not considered an 
outlier at the 5%-level, both using the standard ML estimate fx t i = 4.7895 
(outlier region [9, +oo)), and using the more robust L\ estimate f\ t \ = 3.5 
(outlier region [8, +oo)). 



As mentioned above, we adopt here a different point of view to set up the 
definition and the detection of the outliers in a contingency table. We define 
them using a model-based approach with appropriate goodness-of-fit tests 
for the comparison of two nested models. The s tarting point is similar t o 
the definition of types and antitypes in CFA, see Kieser and Victor ( 19991 ). 
but after the first definitions we will use Algebraic Statistics to understand 
and generalize the notion of outlier. 

Given a contingency table with K cells, let us consider a statistical toric 
model for the table. The model has the expression: 



Pk 



IK 



(6) 



for all k = 1, . . . , K. This model with matrix representation A will be named 
as the base model. Moreover, let a € (0, 1). 



S 



Definition 1 The cell h, h G {1, . . . , K} is an a- level outlier with respect 
to the base model if the model 



f \{ r Xr k ' r for k + h 



Pk = < 



{ Ur (> r Cl s) for k = h 



(7) 



(s) 

is significantly better than the base model at level a, where Q is a new 
non-negative parameter. 

This means that we compare two toric models: 

• the base model in Eq. ([6]) with matrix representation A; 

• the model in Eq. ([7]), whose design matrix is 

A=[A\ I h ] 

where Ih is the indicator vector of the cell h: Ih is a vector of length 
K with all components equal to but the h-ih component equal to 1. 

Notice that we do not test the goodness-of-fit of the model in Eq. ([7]) , 
but we only compare it with the base model. 

To avoid trivialities in Definition Q3 we suppose that the cell h is not a 
component of the sufficient statistic of the base model, i.e., we suppose that 
the matrices A and A satisfy the relation: rank(A) = rank(yl) + 1. In fact, 
if rank(j4) = rank(j4), then the count in the cell h is already a component of 
the sufficient statistic of the base model and the goodness-of-fit test becomes 
useless. 

(s) 

From the point of view of toric models, the new parameter Q ' imposes 
the exact fit of the candidate outlier h. Although it is possible to find easy 
algebraic relations between the ideal Ta of the base model and the ideal 2^, 
we focus here on the geometric analysis of the statistical models. In terms of 
varieties, the variety Va is a subset of V^. This follows from the proposition 
below. We will use it also in the next section, thus we state the result in a 
general setting. 

Theorem 1. Let A\ and A<i be two integer non-negative matrices with K 
rows, and let Im(^4i) and Im(yl2) be their images, as vector spaces in W K . 
I/Im(^i) C Im(A 2 ) ; then Va 1 C Va 2 - 
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Proof. By virtue of Proposition [2] in the Appendix, we have to show that 
Ia 2 C Ia 1 - Let g be a polynomial in Za 2 - Then, 



g = ngi + ■■■ + ng t 

where {g±, . . . , gi} is a system of generators of Ia 2 and r%, . . . , ri are poly- 
nomials. 

From Theorem [2] in the Appendix, g%, . . . , g% are binomials and their log- 
vectors (see Definition [9] in the Appendix) mi, . . . , mi are in ker^A^)- As 
ker(^2) C ker(j4^), we have also that g £ X / 4 1 . This proves the result. □ 

The inclusion Va C Vj follows from Theorem [1] with A\ = A and A2 = 

I. 

To actually check if a cell is an outlier, it is enough to implement the 
goodness-of-fit test in D efinition [TJ T his test can be done using the log- 



likelihood ratio statistic (jAgrestil . 12002) . page 591). The test statistic has the 
expression 



k=l 



fok , 



where /ofc and f\j. are the maximum likelihood estimates of the expected 
cell counts under the base model with design matrix A and the model with 
design matrix A, respectively. The value of G 2 must be compared with the 
appropriate quantiles of the chi-square distribution with 1 df. 

Alternatively one can make exact inf erence via Markov bases and the 
Diaconis-Sturmfels algorithm (see Ch.l in lDrton et all ( 20091 )). 



Given an observed contingency table / € N K and a Markov basis Ma for 
the base model, one can apply the Diaconis-Sturmfels algorithm by sampling 
B contingency tables from its reference set 

F A {f) = {/' G n K : A t f = A'f} . 

The reference set is the set of all contingency tables with the same value of 
the sufficient statistic A t f as the observed table. The relevant distribution on 
J~A(f) is the hyper geometric distribution 'H(f'), and the explicit expression 
of this distribution is 



E fe , 4( /) 1 /n,i/(/ i 



See Drton et al. (120091 1 for details on the derivation of this distribution. To 



actually sample from the reference set with the prescribed distribution, we 
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implement a Metropolis-Hastings Markov chain starting from the observed 
table. At each step: 

1. let / be the current table; 

2. choose with uniform probability a move m G Ma an d a sign e = ±1 
with probability 1/2 each; 

3. define the candidate table as /+ = / + em; 

4. generate a random number u with uniform distribution over [0, 1]. If 
/+ > and 

mm < 1, — — > > u 

then move the chain in / + ; otherwise stay at /. 

The use of a Markov basis as set of moves ensures the connectedness of the 
Markov chain. The proportion of sampled tables with test statistics greater 
than or equal to the test statistic of the observed one is the Monte Carlo 
approximation of p- value of the log-likelihood ratio test. 



Example 2 Analyzing the contingency table in Example Q] with a Monte 
Carlo approximation based on B = 10, 000 tables we obtain an approxi- 
mated p-value 0.1574, showing that there is no evidence to conclude that 
the cell (1, 1) is an outlier. In this example, the asymptotic p-value based 
on the chi-squared approximation is 0.0977, with a noteworthy difference 
with respect to the Monte Carlo approach. Notice that in similar problems 
the asymptotic approximation dramatically fails. To see this, consider the 
observed table 





2 


2 


2 ^ 


2 


2 


2 


2 


2 


2 


2 


2 


V3 


2 


2 


V 



r 



This table differs from the first example in Eq. (|SJ) only in the first cell. 
Here, the cell (1, 1) is an antitype with an observed count less than the 
expected under independence, while in Eq. © the cell (1, 1) was a type. 
For this table /', the Monte Carlo p- value is 0.1856, while the corresponding 
asymptotic approximation is 0.0522. 

All the simulations pres e nted i n this paper has been performed in R, see 
R Development Core Team ( 20ld ) togethe r with the gl im package to make 
inference on generalized log-linear models dPuffvl . l20ld ). 
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Remark 3 From the discussion in Example Q] one sees that we have used 
our procedure only for the confirmatory step. Nevertheless, in the simple 
case of a single outlier the test can also be used to detect an outlier. Is is 
enough to run the test once for each cell. 

Finally, we remark that in many cases the computation of a Markov 
basis Ma for the base model does not need explicit symbolic computa- 
tions. In fact, for several statistical models, such as independence, symme- 
try, qua^independence, a Markov b a sis ha s been computed theoretically, 
see Drton et al. ( 20091 ) and Rapallol (j2003l ). For instance, our numerical 



example in this section considers the independence model as base model 
and a suitable Markov basis is formed by the 36 basic moves of the form 
+1 -1 N 



, , for all 2 x 2 minors of the table. 
-1 +1/ 

In view of the connections between Markov bases and varieties, this 
example is quite simple from the point of view of Geometry. In fact, the 
variety of the base model is described by the vanishing of all 2 x 2 minors of 
the table of probabilities. In the same way, it is easy to see that the variety 
of the model with one outlier is described by the vanishing of the 27 2 x 2 
minors not involving the (1, 1) cell. 

4 Sets and patterns of outliers 

Definition Q] can be easily extended to a set of outliers. 



Definition 2 The cells hi,...,h m form an a- level set of outliers with 
respect to the base model if the model 

{]!,. Cr k ' r for k 7^ hi, . . . ,h m 

(8) 
Y\ r (r k - r (l s) for k = h 1 ,...,h m 

(s) (s) 

is significantly better than the base model at level a, where Q , . . . , Q are 
m new non-negative parameters. 

In analogy with our previous analysis, notice that the model in Eq. ([8]) 
has matrix representation 

A = [A\I hl | ••• |4J, 
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where , . . . , Ih m are the indicator vectors of the cell h% , . . . , h m respec- 
tively. 

Also in this definition, to avoid trivialities, we suppose that the cells 
hi, ... , h m are not components of the sufficient statistic of the base model, 
i.e., we suppose that rank(^4) > rank(A). It is clear that the difference 
rank(^4) — rank(^4) is just the number of degrees of freedom of the goodness- 
of-fit test. The test procedure can be performed with the same technique as 
for a single outlier. The algorithm is essentially the same as in Section [3] for 
a single outlier. 



Example 3 Let us consider the independence model for 4 x 4 tables as the 
base model, as in the previous discussion. Now, we look at the 8 cells on the 
diagonal and the anti-diagonal as the set of outliers. The ideal of the base 
model is generated by the 36 2 x 2 minors of the table of probabilities, while 
computation of the ideal without the 8 variables px,li • • • >P4,4>2 5 1,4> • • • >P4,i 
gives an ideal generated by the 2 binomials: 

-Pl,3P4,2 +Pl,2P4,3, -P2,4P3,1 + P2AP3A ■ 

When the dimensions of the table increase, the toric ideals become more 
complicated. For instance, the same problem as above for 5x5 tables yields a 
base model generated by the 100 2x2 minors of the table of probabilities, and 
the toric ideal without the 9 variables pi t i, . . . ,P5, 5,^1,5, • • • ,Ps,i is generated 
by 28 binomials: 10 binomials of degree 2 of the form —pi, 4^3,2 + Pi,2P3,4) 
and 18 binomials of degree 3 of the form ^3,5^4,3^5, 2 — P3,2Pi,5P5,3- 

As mentioned in the Introduction, one among the key points of Algebraic 
Statistics lies in the possibility to make the description and the meaning of 
log-linear models easier. Thus, we can enrich the base model in many ways. 



Definition 3 The cells hi,... , h m form an a-level pattern of outliers with 
respect to the base model if the model 



Pk 



n r c^ r c (p) for k = hi,...,K 



for k ^ hi, . . . ,h r , 



is significantly better than the base model, where £W is a new non-negative 
parameter. 
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To avoid trivialities in Definition [31 we suppose that the indicator vector 
of the cells hi, ... , h m is not a component of the sufficient statistic of the 
base model, i.e., we suppose that the matrices A and A satisfy: rank(^4) = 
rank(^) + 1. 

Remark 4 Notice that in Definition [3] the outlying cells in a pattern are 
characterized by a single parameter (( p \ This means that we assume a 
common behaviour of that cells. 

As an immediate consequence of Theorem [H we have the following result 
about the connections between sets and patterns of outliers. 

Proposition 1. Let h\, ... , h m be m cells. The model with hi, ... , h m as a 

set of outliers contains the model with hi, ... , h m as a pattern of outliers. 

It follows that the definition of set of outliers in Definition [2] is stronger 
than the definition of pattern of outliers. On the other hand, the notion of 
pattern of outliers may help in finding parsimonious models. 

Remark 5 In the case of sets and patterns of outliers, the procedure pre- 
sented in this paper is confirmatory, and a preliminary step is needed in 
order to select the potential outliers. This step can be done through the 
analysis of the residuals under the base model. We follow this approach in 
the numerical examples below. 



Example 4 The definitions of set of outliers and pattern of outliers are very 
flexible and can be combined in many way s. In order to show this f eature, 
we reconsider the following data analyzed in von Eye and Mair ( 20081 ) about 



the size of social network. The sample is formed by 516 individuals, classified 
by marital status (M = 1 married, M = 2 not married), gender (G = 1 male; 
G = 2 female), and size of social network (S = 1 small, 5 = 2 large). The 
8 cell counts are listed in Table [U together with the expected cell counts / 
and the Pearsonian residuals (/ — /)//). 

As a base model, we use the complete independence model, which can 
be written in log-linear form (with the usual log-linear notation) as: 
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M 


G 


5 


/ 


/ 


(/ " /)// 


1 


1 


1 


48 


38.9 


1.45 


1 


1 


2 


87 


38.1 


7.93 


1 


2 


1 


5 


38.9 


-5.44 


1 


2 


2 


14 


38.1 


-3.90 


2 


1 


1 


78 


91.6 


-1.42 


2 


1 


2 


45 


89.4 


-4.70 


2 


2 


1 


130 


91.6 


4.02 


2 


2 


2 


109 


89.4 


2.07 



Table 1: Data on social network size. 



The ideal of this base model is: 

Ideal(pi i2 ,lP2,l,l -Pl,l,lP2,2,l,Pl,2,lP2,l,2 -Pl,l,2P2,2,l, 
-Pl,2,2P2,2,l + Pl,2,lP2,2,2, -P2,1,2P2,2,1 + P2,l,lP2,2,2, 
-Pl,l,2P2,l,l + Pl,l,lP2,l,2,Pl,2,2P2,l,l ~ Pl,l,2P2,2,l , 
Pl,2,2P2,l,2 - Pl,l,2P2,2,2, ~Pl, 1,2P2,2,1 + Pl,l,lP2,2,2 , 
-Pl,l,2Pl,2,l +Pl,l,lPl,2,2) • 

Thus, a Markov basis for this model is formed by 9 moves. A quick inspection 
of the residuals suggests that the cells (1,1,2) and (2,2,1) are potential 
types, while the cells (1,2, 1), (1,2,2) and (2, 1,2) are potential antitypes. 

If one would run a test for each of the previous cells as in Definition 
[H the approximated Monte Carlo p-values are in all cases. Notice also 
that in this example the definition of set of outliers as in Definition [2] is not 
helpful, as the corresponding model become saturated. However, if we run 
the Monte Carlo test as in Definition [3] with these 5 cells as a unique pattern 
of outliers, we obtain a p-value 0.1411, showing that the 5 cells do not have 
a common behaviour, but the test with two patterns of outliers, namely 
the potential types and antitypes separately, exhibits a p-value 0.0001, with 
strong evidence that the cells in the two patterns {(1, 1, 2), (2, 2, 1)} and 
{(1, 2, 1), (1, 2, 2), (2, 1, 2)} have a homogeneous behaviour in deviating from 
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the base model. The design matrix for this model is 

0\ 

1 
1 
1 
' 

1 

1 
0/ 

where the first 4 columns of A correspond to the parameters of the base 
model, while the last two columns correspond to the two parameters addi- 
tional parameters of the model with two patterns of outliers. In this example, 
we are able to describe the outlying cells with only two additional parame- 
ters. The interpretation of this model could be that the three types and two 
antitypes have common causes, but such an interpretation would require a 
more detailed data analysis and is beyond the scope of this paper. Here, we 
limit ourselves to provide a mathematical description of the outliers. 

We note that the model with two patterns of outliers has a less clear geo- 
metric description with respect to the base model. In fact, the corresponding 
ideal is: 

Ideal(-pi t 2,2P2,l,l + Pl,l,lPl,2,lP2,l,2P2,2,2, 
-P1,1,2P1,2,1P!,1,1 + Pl ) l,lP2,l,2P2,2,l,Pl,l,lPl ) 2 ) 2P2,2,l ~ Pl,l,2Pl,2,li>2,2,2, 

^1,2,2^2,1,1^2,2,1 -Pl,l,2Pl,2,lP2,l,2p! 2 , 2 ) • 
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1 
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Example 5 In this example, we show the practical applicability of o ur tech- 
nique in the case of large tables. We analyze the data presented in lAgresti 
as an exercise on logit models for multinomial responses. The con- 
tingency table, reported in Table [21 refers to a sample of residents of Copen- 
hagen. The individuals of the sample were classified according to 4 categor- 
ical variables: type of housing (H), degree of contact with other residents 
(C), feeling of influence on apartment management (I), and satisfaction with 
housing conditions (S). The table has dimensions 4x3x2x3, for a total 
of 72 cells, and S has the role of response variable. 

As base model, we use a log-linear model including the 4 main effects 
and the interactions [HS], [CS], [IS], that is, the interactions between the 
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XOWcI LUOCKS 


Low 


21 


21 


28 


14 


19 


37 




ivieumm 


34 


22 


36 


17 


23 


40 




Hip 1 !) 


10 


11 


36 


3 


5 


23 


Apartments 


Low 


61 


23 


17 


78 


46 


43 




Medium 


43 


35 


40 


48 


45 


86 




High 


26 


18 


54 


15 


25 


62 


Atrium houses 


Low 


13 


9 


10 


20 


23 


20 




Medium 


8 


8 


12 


10 


22 


24 




High 


6 


7 


9 


7 


10 


21 


Terraced houses 


Low 


18 


6 


7 


57 


23 


13 




Medium 


15 


13 


13 


31 


21 


13 




High 


7 


5 


11 


5 


6 


13 



Table 2: Data on housing conditions in Copenhagen. 

response variable and the other three variables. This model has 51 degrees 
of freedom and fits the data poorly. A Markov basis for this model is formed 
by 360 moves and its computation with 4ti2 in carried out in few seconds. 

Analyzing the residuals of this table under the base model, we note that 
there are 2 Pearsonian residuals exceeding 3 (in absolute value). The two 
cells are: 

- H = "Tower blocks", C ="Low", I = "Medium", S ="Low". The 
observed count is 34 versus a predicted count 16.62, with a Pearsonian 
residual equal to 4.263; 

- H = "Terraced houses", C ="High", I ="Low", S ="Low". The 
observed count is 57 versus a predicted count 35.58, with a Pearsonian 
residual equal to 3.590. 

(the counts of these cells are printed in bold in Table [2]). 

We consider these two cells as a set of outliers and we run the Monte 
Carlo algorithm as in the previous example. The approximated Monte Carlo 
p-value is (and the asymptotic p-value is 1.8 • 10~ 9 ). This shows that the 
proposed set of outliers is highly significant. Moreover, we note that the 
log-likelihood ratio statistic decreases from the value of 123.19 for the base 
model to 88.51 for the outlier model adding only 2 parameters. Looking at 
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the table, this means that these two cells have a special behaviour, and a par- 
ticular inspection of the above combinations could give relevant information 
on the data. 



5 Final remarks 

In this paper, we have shown how Algebraic Statistics is useful in address- 
ing the problem of outliers in contingency tables. In particular, we have 
shown the efficacy of this approach in two directions: (a) the use of non- 
asymptotic inference for statistical models to recognize outliers; (b) a simple 
and practical description of such statistical models from the point of view 
of Geometry. 

In particular, we have shown that Algebraic Statistics allows us to a sim- 
ple definition of set of outliers, patterns of outliers, and their combinations. 

Of course, the theory presented here does not exhaust all the research 
themes on this topic. Many questions remain still open, and among these 
problems we mention: the need for procedures and algorithms for the recog- 
nition of outliers; the problems of the choice of the a-level for multiple tests, 
using Bonferroni-type techniques. These problems a r e wid ely discussed in 
many articles cited above, see e.g. Kieser and Victor ( 19991 ). 



Prom the perspective of Algebraic Statistics, some interesting issues are 
yet to be explored: 

• The connections between the models studied here and the mixture 



models. Mixture models for the s 
diagonal are already considered in 



pecial case of outliers on the main 



Bocci et all lj20ld ): 



The characterization of the Markov bases for the models with outliers 
can yield useful information about the structure of the corresponding 
statistical models. Although in the case of a single pat t ern o f outliers 
some Markov bases are already computed in lHara et all (j2009h . yet the 



general case with several outliers and patterns of outliers is currently 
unexplored. 
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A Basic definitions and tools from Algebraic Statis- 
tics 

In this appendix we collect some basic facts about toric ideals and statistical 



Drton et al. ( 


2009). 


Pistone et all ( 


2001) 



Let R[p, £] = M[pi, . . . ,px, Coj Cij • • • > Cd] De t ne polynomial ring in the 
variables pi, . . . ,Pk, Cli ■ ■ ■ > Cd with real coefficients. 

Definition 4 [Polynomial ideal] An ideal X in M[p, £] is a set of polynomials 
such that for all g, h E X, g + h 6 X and for all g G X, h G C] ; 5^ € 

The Hilbert's basis theorem states that every polynomial ideal X as in 
Definition [4] has a finite set of generators {gi, . . . ,ge}, i.e., for all g G I, 
there exist r%, . . . , ri 6 C] with g = rigi + . . . + r^. In such a case, we 
write 

X = Ideal(#i, ...,gi). 
Let A be a non-negative integer matrix with K rows and d columns. 

Definition 5 [Toric model] The toric model associated to A is the set of 
probability distributions on {1, . . . , K} satisfying 

d 

Pk = Co n c^ ,r 

r=l 

for all k = 1, . . . , K. 

In the definition above, the parameter Co acts as a normalizing constant. 
As noticed in Section [2l a toric model is the extension of a log-linear model 
and the matrix A is the matrix representation of the minimal sufficient 
statistics. 

Now, define the ideal J a as the ideal generated by the set of binomials 

|p*-n^* ,r ; k=i,...,A . 
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Eliminating the £ parameters, i.e., intersecting the ideal J a with the poly- 
nomial ring M.\p] C £], we define the toric ideal associated to A. 

Definition 6 The toric ideal Xa associated to A is 

1 A = Elim(C, J A ) = n M[p] . (9) 



It is known that the toric ideal in Eq. Q is generated by a finite set 
of pure homogeneous binomials {b±, . . . ,bi}. To actually compute a set of 



(CoCoATeam 




2009) 


in 4ti2 ( 


4ti2 team. 



For toric ideals, 



The toric ideal Xa has two major meanings in Algebraic Statistics. From 
the combinatorial side, the binomials &i, . . . ,bg specify a Markov basis for 
the statistical model, while from a geometric point of view they describe the 
statistical model. 



Definition 7 Let / 6 N K be a contingency table with K cells, and let A 
be a K x d matrix. The reference set of / under A is: 



Ja(/) = {/' e N fc : A*/' = A*/| 



Definition 8 [Markov basis] A set of tables M-a = {mi, ■ ■ • m j £ 

Z^, is a Markov basis for the reference set J~A{f) if A t rrij = for all j, 
and for any pair of tables /',/" € Fa{I) there exist a sequence of moves 
(m.j 1 , . . . , m^) and a sequence of signs (ei)^i with = ±1 such that 

w w 

f = f + ^2 eim 3i and f' + ^2 eim i> - 

1=1 1=1 
for all 1 < w < W . The elements of a Markov basis are called moves. 
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Definition 9 [log-vector] Given a binomial in 

b = J[pf {k) -X\vT {k \ 

k=l k=l 

its log-vector is 

m = m + — mT € 7h K . 

Theorem 2 (Diaconis-Sturmfels). A set of vectors {mi, . . . , me} is a Markov 
basis for the toric model associated to A if and only if the corresponding bi- 
nomials b\, . . . , be generate the toric ideal 1a- 

Now, we show how the toric ideal 1a identifies the statistical toric model. 

Definition 10 The set of points 

Va = {p = (pi, • • • ,Pk) ■ g(p) = for all g G 1a} 

is the variety associated to A. 

To actually determine the variety Va, it is enough to solve the polynomial 
system bi(p) = 0, . . . , bt(p) = 0, where b\, . . . , bt is a system of generators of 
I A . 

The relations between the ideal 1a and the variety Va imply that a 
unique computational algorithm produces both the Markov basis and the 
equations defining the variety. Moreover, the following fundamental result 
holds. 

Proposition 2. Let 1a 1 and 1a 2 be two toric ideals. Then: 

l Al C 1 A2 <=> Va 2 C V Al 

Finally, the statistical toric model is formed by the probability distribu- 
tions in Va, i-e., the statistical toric model is simply Va n A. 
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